Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Bump allocator for rustc #38725

Closed
wants to merge 6 commits into from
Closed

Conversation

mattico
Copy link
Contributor

@mattico mattico commented Dec 31, 2016

Inspired by Dlang's allocator, I stuck a bump allocator into rustc to see what would happen. No good benchmarks yet because I need to think about how to do them. Lots of work to be done, but posting now because there should probably be discussion about whether or not this is a Good Idea. For now this will tease:

ripgrep compile:
BEFORE: 53.52s
AFTER: 44.38s

TODO:

  • Memory use profile
  • Better benchmarks
  • Less hackish way to inject
  • Rename from Frame to Bump probably
  • Expose as unstable in std?
  • Is this even a Good Idea?
  • Tidy
  • Rebase

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

const CHUNK_ALIGN: usize = 4096;

static mut HEAP: *mut u8 = ptr::null_mut();
static mut HEAP_LEFT: usize = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are global and there's no implicit lock around allocation calls, so the implementation of __rust_allocate below is full of data races. Maybe these should be atomics or thread local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yep. I had implemented this using atomics the first time I wanted to do this (but got stuck getting the allocator to inject). I'll go dig that code up.

@alexcrichton
Copy link
Member

This seems like it's mostly related to the compiler, so

r? @eddyb

@rust-highfive rust-highfive assigned eddyb and unassigned alexcrichton Dec 31, 2016
}

let heap = HEAP.load(Ordering::SeqCst);
let heap_left = HEAP_LEFT.load(Ordering::SeqCst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this eliminates the instant UB due to data races, but it's still a logical race. Suppose a context switch happens after these two loads have been executed but before the following lines are executed, and another thread allocates. When the first thread continues, it will return the same address as the other thread did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I have little experience with atomics, as you may have guessed. I would rather use a Mutex or thread_local!, but neither of those are in core. I'll make it use a spinlock which should hopefully be correct.

@@ -137,6 +137,12 @@
# Whether or not jemalloc is built with its debug option set
#debug-jemalloc = false

# Whether or not the frame allocator is built and enabled in std
#use-alloc-frame = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this option is used. Is the doc out of date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still TODO

@@ -13,6 +13,7 @@ crate-type = ["dylib", "rlib"]
alloc = { path = "../liballoc" }
alloc_jemalloc = { path = "../liballoc_jemalloc", optional = true }
alloc_system = { path = "../liballoc_system" }
alloc_frame = { path = "../liballoc_frame" }
Copy link
Contributor

@brson brson Dec 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes all platforms require a alloc_frame implementation, doesn't it? I don't have much opinion about that. It doesn't look too hard to implement. A better setup would be for this feature to be optional and off by default, so porters don't have to think about it. I don't know whether that's necessary for this PR but something to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, hm. Platforms without libc would have a hard time implementing frame_alloc.

cc @jackpot51

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, this is meant to be optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brson: Redox is using the libc allocator right now, so this would not be an issue.

@brson
Copy link
Contributor

brson commented Dec 31, 2016

This is a neat exercise. Thanks for submitting it.

@mattico
Copy link
Contributor Author

mattico commented Dec 31, 2016

One more data point for now, I'll have to do better benchmarks in a few days:
libcore before: 10.3 +- 0.7
libcore after: 10.5 +- 0.5

@eddyb
Copy link
Member

eddyb commented Dec 31, 2016

cc @rust-lang/compiler


pub unsafe fn allocate(size: usize, align: usize) -> *mut u8 {
if align <= MIN_ALIGN {
libc::malloc(size as libc::size_t) as *mut u8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to inter-operate with the regular system allocator? If that’s not the case, then why not use a plain sbrk?

Copy link
Contributor

@hanna-kruppe hanna-kruppe Dec 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, LLVM will be using its the system allocator in the same process, won't it? Calling sbrk may interact badly with that. nvm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is probably a good idea. I'll look into using sbrk and VirtualPageAllocEx or whatever. The current implementation as simple as possible just to test out the idea.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Dec 31, 2016

@mattico Regarding thread safety in the implementation, I think a spin lock is okay. Most parts of the compilation process aren't parallelized, and those that are mostly happen within LLVM, which won't be using this allocator IIUC. Besides, I really wouldn't be comfortable with a complicated atomics-based implementation, even if nobody could poke holes in it — these things are notoriously tricky and the failure mode (non-deterministic allocator bugs) is really terrible.

Thread locals would have been ideal, since this allocator never frees memory there are none of the usual complications and it would neatly solve any concurrency issues. But whatever.

@mattico
Copy link
Contributor Author

mattico commented Dec 31, 2016

@rkuppe the current spinlock implementation does seem to have significant overhead. I'll post benchmarks when I get back to a computer.

I agree about a more complicated atomic implementation. If one were even desired, I definitely shouldn't be the one to write it.

I wish we had thread locals in core :/ Perhaps someday core and std will be modularized enough to use them without needing an allocator. I'll look into what hackery would be required to use them here but I don't expect it'll be worthwhile.

@bors
Copy link
Contributor

bors commented Dec 31, 2016

☔ The latest upstream changes (presumably #38482) made this pull request unmergeable. Please resolve the merge conflicts.

@nikomatsakis
Copy link
Contributor

Interesting thought experiment. I'd definitely want to run a wider variety of compiles to compare the performance characteristics (also -- can we measure peak memory usage?)

@pnkfelix pnkfelix added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jan 3, 2017
@mattico
Copy link
Contributor Author

mattico commented Jan 4, 2017

Profiling is going to have to wait until next week, since performance counters don't work in WSL/Virtualbox, and I can't get MSVC to build. My current computer doesn't have an actual linux install.

@mattico
Copy link
Contributor Author

mattico commented Jan 16, 2017

I don't have time to work on this at the moment, so I'm closing this for now just to keep the PR queue clean. I'll revisit this sometime soon.

@mattico mattico closed this Jan 16, 2017
@mattico
Copy link
Contributor Author

mattico commented Mar 16, 2017

In case anybody is wondering where this went, I tested this on a 4 core machine and the performance degraded to the point that it was a bit slower than jemalloc. I imagine that 4 threads stuck behind a single lock erased any gains we may have had. This is worth revisiting if we ever get thread_local! in core, due to some user-space implementation like https://github.com/Amanieu/thread_local-rs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.